Predicting Quality of Portuguese Vinho Verde White Wine
Group 009E05
Mason Feng
The University of Sydney
Model Selection
In order to create our model, we must first choose our predictor variables. We will compare 2 methods:
Stepwise Selection
Lasso Regression
We will compare these models using \(RMSE\), \(MAE\), \(R^2\), \(AIC\) and \(BIC\). From our EDA we noticed that both “alcohol” and “residual.sugar” have high correlations with “density”. This means that we need to be careful of potential multicollinearity and we may want to remove “density” from our model.
What is multicollinearity?
Multicollinearity means that our predictors are correlated. We want to reduce this in our model which will be done by checking the Variance Inflation Factor (VIF) of our variables.
Stepwise Selection
For this method we will be comparing both forward and backwards selection. We know from lectures, that this uses AIC as the performance metric. Lets perform stepwise selection and view our performance metrics:
We notice from the previous slide that both forward and backward select choose the same model, which result in the exact same performance metrics. However, we haven’t dealt with the issue of multicollinearity. We will now check using VIF.
Dealing with multicollinearity
VIF is a measure of multicollinearity in the model, where higher values signify higher correlation (which we want to avoid!). The formula is as follows:
\[VIF_i = \frac{1}{1-R^2_i}\]
Generally we want to ensure that every variable has a \(VIF\) that is \(<5\). We use the vif() function from the car library on our model object:
We observe that we have three variables that have \(VIF>5\): “alcohol”, “residual.sugar” and “density”, which is as expected from our EDA. To reduce our multicollinearity, we can remove variables with high multicollinearity, but even then, stepwise selection has a multitude of statistical problems. What if there was a better way of variable selection?
Lasso Regression
Least Absolute Shrinkage and Selection Operation or LASSO, is a regression method utilising \(\ell_1\)-regularisation where the parameters of the regression model are:
\[\beta^{lasso}_\lambda = \underset{\beta}{\operatorname{\arg\max}} \Biggl\{ \underbrace{\sum_{i=1}^n\Biggl( y_i-\beta_0-\sum_{j=1}^p\beta_jx_{ij}\Biggr)^2}_{\text{Residual Sum of Squares}\; (RSS)}+\lambda\sum_{j=1}^p|\beta_j| \Biggr\}\]
LASSO is often used when we also need to do variable selection, since it tends to generate 0 for some coefficients for large enough values of \(\lambda\). Therefore, we can use LASSO to perform our variable selection and modelling at the same time. We also do not need to worry about multicollinearity as LASSO shrinks the coefficients of these problematic variables to 0.
Main idea behind LASSO
The main idea of LASSO is that we introduce a small amount of Bias into the way we fit our model. But in return for that small amount of Bias we get a significant drop in Variance.
Remembering the LASSO regression formula, we see that we need to choose a suitable hyper-parameter \(\lambda\), which is done through Cross Validation e.g. 10-fold CV.
Performing LASSO regression
set.seed(2002)cv_lasso =cv.glmnet(x, y, alpha =1, standardize =TRUE, nfolds =10)plot(cv_lasso)
Performing LASSO regression
Hyperparameter choice from CV
cv_lasso$lambda.min
[1] 0.002537796
Selected coefficients for our Lasso model
coef(cv_lasso)
12 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) 2.732099287
fixed.acidity -0.039233911
volatile.acidity -1.750671036
citric.acid .
residual.sugar 0.015682518
chlorides -0.547985113
free.sulfur.dioxide 0.002585403
total.sulfur.dioxide .
density .
pH 0.035647346
sulphates 0.202078813
alcohol 0.335047738